HaiBiostat: Factor Analysis applied in Multivariate Regression: Machine Learning's Perspective

Hai Nguyen

Questions

1- What is difference between EFA and PCA?
2- Is it possible that we can run the regression analysis on factors generating from Factor Analysis?
3- If so, how can we run it?

The demonstration is to use the dataset Factor-Hair-Revised.csv to build a regression model to predict satisfaction.

EFA vs. PCA

	FA	PCA
Meaning	A factor (or latent) is a common or underlying element with which several other variables are correlated \(X_1 = L_1\times F + \epsilon_1\) \(X_2 = L_2\times F + \epsilon_2\) \(X_3 = L_3\times F + \epsilon_3\) where, F is the factor, W is are the loading	A component is a derived new dimension (or variable) so that the derived variables are linearly independent of each other \(Y = W_1\times PC_1 + W_2\times PC_2 + ...+ W_{10}\times PC_{10} + C\) where, PCs are the components & W is are the weights
Process	The original variables are defined as the linear combinations of the factors	The components are calculated as the linear combinations of the original variables
Objective	FA focuses on explaining the covariances or the correlations between the variables	The aim of PCA is to explain as much of the cumulative variance in the predictors (or variables) as possible
Purpose	Factor Analysis is used to understand the underlying ‘cause’ which these factors (latent or constituents) capture much of the information of a set of variables in the dataset data	PCA is used to decompose the data into a smaller number of components and therefore is a type of Singular Value Decomposition (SVD)
Interpretation of coefficients	The coefficients (loadings) in the factor analysis express the relationship or association of each variable (X) to the underlying factor (F)	The coefficients indicate which component contributes more to the target variable, Y, as the independent variables are standardized

An Example: Data Exploration

prodsurvey <- read_csv("Factor-Hair-Revised.csv")
head(prodsurvey)

# A tibble: 6 × 13
     ID ProdQual  Ecom TechSup CompRes Adver…¹ ProdL…² Sales…³ ComPr…⁴
  <dbl>    <dbl> <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1     1      8.5   3.9     2.5     5.9     4.8     4.9     6       6.8
2     2      8.2   2.7     5.1     7.2     3.4     7.9     3.1     5.3
3     3      9.2   3.4     5.6     5.6     5.4     7.4     5.8     4.5
4     4      6.4   3.3     7       3.7     4.7     4.7     4.5     8.8
5     5      9     3.4     5.2     4.6     2.2     6       4.5     6.8
6     6      6.5   2.8     3.1     4.1     4       4.3     3.7     8.5
# … with 4 more variables: WartyClaim <dbl>, OrdBilling <dbl>,
#   DelSpeed <dbl>, Satisfaction <dbl>, and abbreviated variable
#   names ¹Advertising, ²ProdLine, ³SalesFImage, ⁴ComPricing

dim(prodsurvey)

[1] 100  13

str(prodsurvey)

spc_tbl_ [100 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ID          : num [1:100] 1 2 3 4 5 6 7 8 9 10 ...
 $ ProdQual    : num [1:100] 8.5 8.2 9.2 6.4 9 6.5 6.9 6.2 5.8 6.4 ...
 $ Ecom        : num [1:100] 3.9 2.7 3.4 3.3 3.4 2.8 3.7 3.3 3.6 4.5 ...
 $ TechSup     : num [1:100] 2.5 5.1 5.6 7 5.2 3.1 5 3.9 5.1 5.1 ...
 $ CompRes     : num [1:100] 5.9 7.2 5.6 3.7 4.6 4.1 2.6 4.8 6.7 6.1 ...
 $ Advertising : num [1:100] 4.8 3.4 5.4 4.7 2.2 4 2.1 4.6 3.7 4.7 ...
 $ ProdLine    : num [1:100] 4.9 7.9 7.4 4.7 6 4.3 2.3 3.6 5.9 5.7 ...
 $ SalesFImage : num [1:100] 6 3.1 5.8 4.5 4.5 3.7 5.4 5.1 5.8 5.7 ...
 $ ComPricing  : num [1:100] 6.8 5.3 4.5 8.8 6.8 8.5 8.9 6.9 9.3 8.4 ...
 $ WartyClaim  : num [1:100] 4.7 5.5 6.2 7 6.1 5.1 4.8 5.4 5.9 5.4 ...
 $ OrdBilling  : num [1:100] 5 3.9 5.4 4.3 4.5 3.6 2.1 4.3 4.4 4.1 ...
 $ DelSpeed    : num [1:100] 3.7 4.9 4.5 3 3.5 3.3 2 3.7 4.6 4.4 ...
 $ Satisfaction: num [1:100] 8.2 5.7 8.9 4.8 7.1 4.7 5.7 6.3 7 5.5 ...
 - attr(*, "spec")=
  .. cols(
  ..   ID = col_double(),
  ..   ProdQual = col_double(),
  ..   Ecom = col_double(),
  ..   TechSup = col_double(),
  ..   CompRes = col_double(),
  ..   Advertising = col_double(),
  ..   ProdLine = col_double(),
  ..   SalesFImage = col_double(),
  ..   ComPricing = col_double(),
  ..   WartyClaim = col_double(),
  ..   OrdBilling = col_double(),
  ..   DelSpeed = col_double(),
  ..   Satisfaction = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

names(prodsurvey)

 [1] "ID"           "ProdQual"     "Ecom"         "TechSup"     
 [5] "CompRes"      "Advertising"  "ProdLine"     "SalesFImage" 
 [9] "ComPricing"   "WartyClaim"   "OrdBilling"   "DelSpeed"    
[13] "Satisfaction"

describe(prodsurvey)

             vars   n  mean    sd median trimmed   mad min   max
ID              1 100 50.50 29.01  50.50   50.50 37.06 1.0 100.0
ProdQual        2 100  7.81  1.40   8.00    7.85  1.78 5.0  10.0
Ecom            3 100  3.67  0.70   3.60    3.63  0.52 2.2   5.7
TechSup         4 100  5.36  1.53   5.40    5.40  1.85 1.3   8.5
CompRes         5 100  5.44  1.21   5.45    5.46  1.26 2.6   7.8
Advertising     6 100  4.01  1.13   4.00    4.00  1.19 1.9   6.5
ProdLine        7 100  5.80  1.32   5.75    5.81  1.56 2.3   8.4
SalesFImage     8 100  5.12  1.07   4.90    5.09  0.89 2.9   8.2
ComPricing      9 100  6.97  1.55   7.10    7.02  1.93 3.7   9.9
WartyClaim     10 100  6.04  0.82   6.10    6.04  0.89 4.1   8.1
OrdBilling     11 100  4.28  0.93   4.40    4.31  0.74 2.0   6.7
DelSpeed       12 100  3.89  0.73   3.90    3.92  0.74 1.6   5.5
Satisfaction   13 100  6.92  1.19   7.05    6.90  1.33 4.7   9.9
             range  skew kurtosis   se
ID            99.0  0.00    -1.24 2.90
ProdQual       5.0 -0.24    -1.17 0.14
Ecom           3.5  0.64     0.57 0.07
TechSup        7.2 -0.20    -0.63 0.15
CompRes        5.2 -0.13    -0.66 0.12
Advertising    4.6  0.04    -0.94 0.11
ProdLine       6.1 -0.09    -0.60 0.13
SalesFImage    5.3  0.37     0.26 0.11
ComPricing     6.2 -0.23    -0.96 0.15
WartyClaim     4.0  0.01    -0.53 0.08
OrdBilling     4.7 -0.32     0.11 0.09
DelSpeed       3.9 -0.45     0.09 0.07
Satisfaction   5.2  0.08    -0.86 0.12

datamatrix1 <- cor(prodsurvey[,-1])
corrplot(datamatrix1, method="number")

As we can see from the above correlation matrix:

Complaint resolution (CompRes) and delivery speed (DelSpeed) are highly correlated
Order billing (OrdBilling) and CompRes are highly correlated
WartyClaim and TechSupport are highly correlated
OrdBilling and DelSpeed are highly correlated
e-commerce (Ecom) and sales force image (SalesFImage) are highly correlated

pcor(prodsurvey[,-1], method="pearson")

$estimate
                 ProdQual          Ecom      TechSup      CompRes
ProdQual      1.000000000  0.1549597037  0.002424222 -0.056354142
Ecom          0.154959704  1.0000000000  0.082359000 -0.033513777
TechSup       0.002424222  0.0823590001  1.000000000  0.143603415
CompRes      -0.056354142 -0.0335137768  0.143603415  1.000000000
Advertising   0.112376746 -0.0002972504 -0.059300254 -0.064806093
ProdLine      0.281144724  0.1538660545 -0.125349050  0.020305650
SalesFImage  -0.376228551  0.7321011890 -0.093310796 -0.022150933
ComPricing   -0.014021386  0.0149131857 -0.132972485 -0.004487151
WartyClaim   -0.042752416 -0.1232553822  0.787729606 -0.109686211
OrdBilling    0.054523215  0.1546990521 -0.165914724  0.288344382
DelSpeed     -0.335084210 -0.0083917930 -0.021914916  0.528932328
Satisfaction  0.607438787 -0.3308440834  0.055111867  0.172416347
               Advertising    ProdLine  SalesFImage   ComPricing
ProdQual      0.1123767461  0.28114472 -0.376228551 -0.014021386
Ecom         -0.0002972504  0.15386605  0.732101189  0.014913186
TechSup      -0.0593002540 -0.12534905 -0.093310796 -0.132972485
CompRes      -0.0648060931  0.02030565 -0.022150933 -0.004487151
Advertising   1.0000000000 -0.13192319  0.262879343 -0.063298038
ProdLine     -0.1319231866  1.00000000 -0.230170570 -0.361757904
SalesFImage   0.2628793429 -0.23017057  1.000000000  0.126612489
ComPricing   -0.0632980381 -0.36175790  0.126612489  1.000000000
WartyClaim    0.0275909587  0.25718360  0.189300271  0.020351554
OrdBilling   -0.0326580430 -0.28098068 -0.182359301 -0.086500869
DelSpeed      0.2046544064  0.50189505  0.005889925  0.190437993
Satisfaction -0.0449735411  0.18325156  0.660251850 -0.087467711
              WartyClaim  OrdBilling     DelSpeed Satisfaction
ProdQual     -0.04275242  0.05452322 -0.335084210   0.60743879
Ecom         -0.12325538  0.15469905 -0.008391793  -0.33084408
TechSup       0.78772961 -0.16591472 -0.021914916   0.05511187
CompRes      -0.10968621  0.28834438  0.528932328   0.17241635
Advertising   0.02759096 -0.03265804  0.204654406  -0.04497354
ProdLine      0.25718360 -0.28098068  0.501895051   0.18325156
SalesFImage   0.18930027 -0.18235930  0.005889925   0.66025185
ComPricing    0.02035155 -0.08650087  0.190437993  -0.08746771
WartyClaim    1.00000000  0.25984451 -0.091649002  -0.08868071
OrdBilling    0.25984451  1.00000000  0.350530212   0.14880055
DelSpeed     -0.09164900  0.35053021  1.000000000   0.08955614
Satisfaction -0.08868071  0.14880055  0.089556143   1.00000000

$p.value
                 ProdQual         Ecom      TechSup      CompRes
ProdQual     0.000000e+00 1.447435e-01 9.819081e-01 5.977987e-01
Ecom         1.447435e-01 0.000000e+00 4.402823e-01 7.538375e-01
TechSup      9.819081e-01 4.402823e-01 0.000000e+00 1.769171e-01
CompRes      5.977987e-01 7.538375e-01 1.769171e-01 0.000000e+00
Advertising  2.916350e-01 9.977814e-01 5.787601e-01 5.439511e-01
ProdLine     7.269030e-03 1.476353e-01 2.391187e-01 8.493380e-01
SalesFImage  2.576212e-04 2.442012e-16 3.817042e-01 8.358301e-01
ComPricing   8.956443e-01 8.890480e-01 2.115149e-01 9.665194e-01
WartyClaim   6.890839e-01 2.471182e-01 3.262868e-20 3.034147e-01
OrdBilling   6.097697e-01 1.454288e-01 1.180864e-01 5.850710e-03
DelSpeed     1.245192e-03 9.374303e-01 8.375553e-01 8.359069e-08
Satisfaction 2.182238e-10 1.447603e-03 6.059096e-01 1.041593e-01
             Advertising     ProdLine  SalesFImage   ComPricing
ProdQual      0.29163501 7.269030e-03 2.576212e-04 0.8956443272
Ecom          0.99778145 1.476353e-01 2.442012e-16 0.8890479591
TechSup       0.57876015 2.391187e-01 3.817042e-01 0.2115148546
CompRes       0.54395113 8.493380e-01 8.358301e-01 0.9665194173
Advertising   0.00000000 2.151737e-01 1.230672e-02 0.5533828069
ProdLine      0.21517368 0.000000e+00 2.907563e-02 0.0004593443
SalesFImage   0.01230672 2.907563e-02 0.000000e+00 0.2343791374
ComPricing    0.55338281 4.593443e-04 2.343791e-01 0.0000000000
WartyClaim    0.79629803 1.440249e-02 7.394541e-02 0.8490014635
OrdBilling    0.75993047 7.304606e-03 8.538047e-02 0.4175555780
DelSpeed      0.05300070 4.664278e-07 9.560619e-01 0.0721942454
Satisfaction  0.67382405 8.383625e-02 1.449719e-12 0.4123500152
               WartyClaim   OrdBilling     DelSpeed Satisfaction
ProdQual     6.890839e-01 0.6097697424 1.245192e-03 2.182238e-10
Ecom         2.471182e-01 0.1454287628 9.374303e-01 1.447603e-03
TechSup      3.262868e-20 0.1180864174 8.375553e-01 6.059096e-01
CompRes      3.034147e-01 0.0058507099 8.359069e-08 1.041593e-01
Advertising  7.962980e-01 0.7599304715 5.300070e-02 6.738241e-01
ProdLine     1.440249e-02 0.0073046065 4.664278e-07 8.383625e-02
SalesFImage  7.394541e-02 0.0853804743 9.560619e-01 1.449719e-12
ComPricing   8.490015e-01 0.4175555780 7.219425e-02 4.123500e-01
WartyClaim   0.000000e+00 0.0133877361 3.902770e-01 4.058729e-01
OrdBilling   1.338774e-02 0.0000000000 7.064114e-04 1.615975e-01
DelSpeed     3.902770e-01 0.0007064114 0.000000e+00 4.012356e-01
Satisfaction 4.058729e-01 0.1615974575 4.012356e-01 0.000000e+00

$statistic
                ProdQual         Ecom     TechSup     CompRes
ProdQual      0.00000000  1.471424516  0.02274128 -0.52949015
Ecom          1.47142452  0.000000000  0.77522957 -0.31456380
TechSup       0.02274128  0.775229571  0.00000000  1.36122814
CompRes      -0.52949015 -0.314563798  1.36122814  0.00000000
Advertising   1.06090746 -0.002788456 -0.55726637 -0.60921569
ProdLine      2.74821968  1.460787002 -1.18522655  0.19052316
SalesFImage  -3.80921125 10.081854496 -0.87916864 -0.20784517
ComPricing   -0.13154519  0.139913642 -1.25856891 -0.04209363
WartyClaim   -0.40142023 -1.165122048 11.99562484 -1.03519395
OrdBilling    0.51223505  1.468888754 -1.57829305  2.82489237
DelSpeed     -3.33624278 -0.078724768 -0.20562952  5.84663073
Satisfaction  7.17336519 -3.288799958  0.51778207  1.64199901
              Advertising   ProdLine SalesFImage  ComPricing
ProdQual      1.060907458  2.7482197 -3.80921125 -0.13154519
Ecom         -0.002788456  1.4607870 10.08185450  0.13991364
TechSup      -0.557266374 -1.1852266 -0.87916864 -1.25856891
CompRes      -0.609215688  0.1905232 -0.20784517 -0.04209363
Advertising   0.000000000 -1.2484608  2.55592188 -0.59498137
ProdLine     -1.248460807  0.0000000 -2.21876450 -3.64012829
SalesFImage   2.555921881 -2.2187645  0.00000000  1.19736653
ComPricing   -0.594981366 -3.6401283  1.19736653  0.00000000
WartyClaim    0.258924708  2.4965744  1.80849286  0.19095404
OrdBilling   -0.306523103 -2.7464786 -1.73985585 -0.81450302
DelSpeed      1.961341689  5.4434474  0.05525335  1.81976992
Satisfaction -0.422316520  1.7486638  8.24679932 -0.82367672
             WartyClaim OrdBilling    DelSpeed Satisfaction
ProdQual     -0.4014202  0.5122350 -3.33624278    7.1733652
Ecom         -1.1651220  1.4688888 -0.07872477   -3.2888000
TechSup      11.9956248 -1.5782930 -0.20562952    0.5177821
CompRes      -1.0351939  2.8248924  5.84663073    1.6419990
Advertising   0.2589247 -0.3065231  1.96134169   -0.4223165
ProdLine      2.4965744 -2.7464786  5.44344736    1.7486638
SalesFImage   1.8084929 -1.7398559  0.05525335    8.2467993
ComPricing    0.1909540 -0.8145030  1.81976992   -0.8236767
WartyClaim    0.0000000  2.5242649 -0.86337748   -0.8351894
OrdBilling    2.5242649  0.0000000  3.51103503    1.4115877
DelSpeed     -0.8633775  3.5110350  0.00000000    0.8435005
Satisfaction -0.8351894  1.4115877  0.84350046    0.0000000

$n
[1] 100

$gp
[1] 10

$method
[1] "pearson"

We can see that there is a high degree of collinearity between the independent variables.

Multiple Linear Regression

model1 <- lm(Satisfaction ~ ., prodsurvey[,-1])
summary(model1)


Call:
lm(formula = Satisfaction ~ ., data = prodsurvey[, -1])

Residuals:
     Min       1Q   Median       3Q      Max 
-1.43005 -0.31165  0.07621  0.37190  0.90120 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.66961    0.81233  -0.824  0.41199    
ProdQual     0.37137    0.05177   7.173 2.18e-10 ***
Ecom        -0.44056    0.13396  -3.289  0.00145 ** 
TechSup      0.03299    0.06372   0.518  0.60591    
CompRes      0.16703    0.10173   1.642  0.10416    
Advertising -0.02602    0.06161  -0.422  0.67382    
ProdLine     0.14034    0.08025   1.749  0.08384 .  
SalesFImage  0.80611    0.09775   8.247 1.45e-12 ***
ComPricing  -0.03853    0.04677  -0.824  0.41235    
WartyClaim  -0.10298    0.12330  -0.835  0.40587    
OrdBilling   0.14635    0.10367   1.412  0.16160    
DelSpeed     0.16570    0.19644   0.844  0.40124    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5623 on 88 degrees of freedom
Multiple R-squared:  0.8021,    Adjusted R-squared:  0.7774 
F-statistic: 32.43 on 11 and 88 DF,  p-value: < 2.2e-16

As in our model the adjusted R-squared: 0.7774, meaning that independent variables explain 78% of the variance of the dependent variable, only 3 variables are significant out of 11 independent variables.
The p-value of the F-statistic is less than 0.05 (level of Significance), which means our model is significant. This means that, at least, one of the predictor variables is significantly related to the outcome variable.
Our model equation can be written as:
Satisfaction = -0.66 + 0.37 x ProdQual -0.44 x Ecom + 0.034 x TechSup + 0.16 x CompRes -0.02 x Advertising + 0.14 x ProdLine + 0.80 x SalesFImage - 0.038 x CompPricing - 0.10 x WartyClaim + 0.14 x OrdBilling + 0.16 x DelSpeed

Variable Inflation Factor (VIF)

vif(model1)

   ProdQual        Ecom     TechSup     CompRes Advertising 
   1.635797    2.756694    2.976796    4.730448    1.508933 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
   3.488185    3.439420    1.635000    3.198337    2.902999 
   DelSpeed 
   6.516014

High Variable Inflation Factor (VIF) is a sign of multicollinearity. There is no formal VIF value for determining the presence of multicollinearity; however, in weaker models, VIF value greater than 2.5 may be a cause of concern.
From the VIF values, we can infer that variables CompRes and DelSpeed are a cause of concern.
Remedial Measures:
Two of the most commonly used methods to deal with multicollinearity in the model is the following.
- Remove some of the highly correlated variables using VIF or stepwise algorithms.
- Perform an analysis design like principal component analysis (PCA)/ Factor Analysis on the correlated variables.

Factor Analysis

X <- prodsurvey[,-c(1,13)]
y <- prodsurvey[,13]

Let’s check the factorability of the variables in the dataset: perform the Kaiser-Meyer-Olkin (KMO) Test.

Kaiser-Meyer-Olkin (KMO) Test and/or Bartlett’s Test is a measure of how suited your data is for Factor Analysis. The test measures sampling adequacy for each variable in the model and for the complete model. The statistic is a measure of the proportion of variance among variables that might be common variance. The lower the proportion, the more suited your data is to Factor Analysis.

datamatrix2 <- cor(X)
KMO(r=datamatrix2)

Kaiser-Meyer-Olkin factor adequacy
Call: KMO(r = datamatrix2)
Overall MSA =  0.65
MSA for each item = 
   ProdQual        Ecom     TechSup     CompRes Advertising 
       0.51        0.63        0.52        0.79        0.78 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
       0.62        0.62        0.75        0.51        0.76 
   DelSpeed 
       0.67

MSA (measure of sampling adequacy) > 0.5, we can run Factor Analysis on this data.

print(cortest.bartlett(datamatrix2, nrow(prodsurvey[,-1])))

$chisq
[1] 619.2726

$p.value
[1] 1.79337e-96

$df
[1] 55

The approximate of Chi-square is 619.27 with 55 degrees of freedom, which is significant at 0.05 Level of significance.

Hence Factor Analysis is considered as an appropriate technique for further analysis of the data.

Find the optimal clusters

Plotting the eigenvalues from our factor analysis (whether it’s based on principal axis or principal components extraction), a parallel analysis involves generating random correlation matrices and after factor analyzing them, comparing the resulting eigenvalues to the eigenvalues of the observed data. The idea behind this method is thatobserved eigenvalues that are higher than their corresponding random eigenvalues are more likely to be from “meaningful factors” than observed eigenvalues that are below their corresponding random eigenvalue.

parallel <- fa.parallel(X)

Parallel analysis suggests that the number of factors =  3  and the number of components =  3

# other option of scree plot
scree(X)

Selection of factors from the scree plot can be based on:

Kaiser-Guttman normalization rule says that we should choose all factors with an eigenvalue greater than 1.
Bend elbow rule

When looking at the parallel analysis scree plots, there are two places to look depending on which type of factor analysis you’re looking to run. The two blue lines show you the observed eigenvalues - they should look identical to the scree plots drawn by the scree function. The red dotted lines show the random eigenvalues or the simulated data line. Each point on the blue line that lies above the corresponding simulated data line is a factor or component to extract. In this analysis, you can see that 4 factors in the “Factor Analysis” parallel analysis lie above the corresponding simulated data line and 3 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line.
In our case, however, the last factor/component lies very close to the line - for both principal components extraction and principal axis extraction. Thus, we should probably compare the 3 factor and 4 factor solutions, to see which one is most interpretable.

The scree plot graphs the Eigenvalue against each factor. We can see from the graph that after factor 4 there is a sharp change in the curvature of the scree plot. This shows that after factor 4 the total variance accounts for smaller amounts.

ev <- eigen(cor(X))
ev$values

 [1] 3.42697133 2.55089671 1.69097648 1.08655606 0.60942409 0.55188378
 [7] 0.40151815 0.24695154 0.20355327 0.13284158 0.09842702

Factor <- rep(1:11)
Eigen_Values <- ev$values
Scree <- data.frame(Factor, Eigen_Values)

ggplot(data = Scree, mapping = aes(x = Factor, y = Eigen_Values)) + 
  geom_point() +
  geom_line() +
  scale_y_continuous(name = "Eigen Values", limits = c(0,4)) +
  theme_bw() + 
  theme(panel.grid.major.y = element_line(color = "skyblue")) +
  ggtitle("Scree Plot")

So as per the elbow or Kaiser-Guttman normalization rule, we are good to go ahead with 4 factors.

Conducting the Factor Analysis

Factor analysis using fa method:

fa.none <-  fa(r=X, 
              nfactors = 4, 
#              covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="none") # none rotation
print(fa.none)

Factor Analysis using method =  pa
Call: fa(r = X, nfactors = 4, rotate = "none", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
              PA1   PA2   PA3   PA4   h2    u2 com
ProdQual     0.20 -0.41 -0.06  0.46 0.42 0.576 2.4
Ecom         0.29  0.66  0.27  0.22 0.64 0.362 2.0
TechSup      0.28 -0.38  0.74 -0.17 0.79 0.205 1.9
CompRes      0.86  0.01 -0.26 -0.18 0.84 0.157 1.3
Advertising  0.29  0.46  0.08  0.13 0.31 0.686 1.9
ProdLine     0.69 -0.45 -0.14  0.31 0.80 0.200 2.3
SalesFImage  0.39  0.80  0.35  0.25 0.98 0.021 2.1
ComPricing  -0.23  0.55 -0.04 -0.29 0.44 0.557 1.9
WartyClaim   0.38 -0.32  0.74 -0.15 0.81 0.186 2.0
OrdBilling   0.75  0.02 -0.18 -0.18 0.62 0.378 1.2
DelSpeed     0.90  0.10 -0.30 -0.20 0.94 0.058 1.4

                       PA1  PA2  PA3  PA4
SS loadings           3.21 2.22 1.50 0.68
Proportion Var        0.29 0.20 0.14 0.06
Cumulative Var        0.29 0.49 0.63 0.69
Proportion Explained  0.42 0.29 0.20 0.09
Cumulative Proportion 0.42 0.71 0.91 1.00

Mean item complexity =  1.9
Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are  55  and the objective function was  6.55 with Chi Square of  619.27
The degrees of freedom for the model are 17  and the objective function was  0.33 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic number of observations is  100 with the empirical chi square  3.19  with prob <  1 
The total number of observations was  100  with Likelihood Chi Square =  30.27  with prob <  0.024 

Tucker Lewis Index of factoring reliability =  0.921
RMSEA index =  0.088  and the 90 % confidence intervals are  0.032 0.139
BIC =  -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1  PA2  PA3  PA4
Correlation of (regression) scores with factors   0.98 0.97 0.95 0.88
Multiple R square of scores with factors          0.96 0.95 0.91 0.78
Minimum correlation of possible factor scores     0.92 0.90 0.82 0.56

Factor analysis using the factanal method:

factanal.none <- factanal(X, factors=4, scores = c("regression"), rotation = "none")
print(factanal.none)


Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "none")

Uniquenesses:
   ProdQual        Ecom     TechSup     CompRes Advertising 
      0.682       0.360       0.228       0.178       0.679 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
      0.005       0.017       0.636       0.163       0.347 
   DelSpeed 
      0.076 

Loadings:
            Factor1 Factor2 Factor3 Factor4
ProdQual     0.467  -0.148  -0.229  -0.159 
Ecom                 0.791                 
TechSup      0.198          -0.482   0.707 
CompRes      0.589   0.316   0.535   0.300 
Advertising          0.556   0.110         
ProdLine     0.997                         
SalesFImage          0.987                 
ComPricing  -0.491   0.252   0.238         
WartyClaim   0.279   0.124  -0.487   0.712 
OrdBilling   0.452   0.275   0.493   0.360 
DelSpeed     0.629   0.364   0.582   0.241 

               Factor1 Factor2 Factor3 Factor4
SS loadings      2.522   2.316   1.469   1.322
Proportion Var   0.229   0.211   0.134   0.120
Cumulative Var   0.229   0.440   0.573   0.694

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113

factanal.var <- factanal(X, factors=4, scores = c("regression"), rotation = "varimax")
print(factanal.var)


Call:
factanal(x = X, factors = 4, scores = c("regression"), rotation = "varimax")

Uniquenesses:
   ProdQual        Ecom     TechSup     CompRes Advertising 
      0.682       0.360       0.228       0.178       0.679 
   ProdLine SalesFImage  ComPricing  WartyClaim  OrdBilling 
      0.005       0.017       0.636       0.163       0.347 
   DelSpeed 
      0.076 

Loadings:
            Factor1 Factor2 Factor3 Factor4
ProdQual                             0.557 
Ecom                 0.793                 
TechSup                      0.872   0.102 
CompRes      0.884   0.142           0.135 
Advertising  0.190   0.521          -0.110 
ProdLine     0.502           0.104   0.856 
SalesFImage  0.119   0.974          -0.130 
ComPricing           0.225  -0.216  -0.514 
WartyClaim                   0.894   0.158 
OrdBilling   0.794   0.101   0.105         
DelSpeed     0.928   0.189           0.164 

               Factor1 Factor2 Factor3 Factor4
SS loadings      2.592   1.977   1.638   1.423
Proportion Var   0.236   0.180   0.149   0.129
Cumulative Var   0.236   0.415   0.564   0.694

Test of the hypothesis that 4 factors are sufficient.
The chi square statistic is 24.26 on 17 degrees of freedom.
The p-value is 0.113

Graph Factor Loading Matrices

Factor analysis results are typically interpreted in terms of the major loadings on each factor. These structures may be represented an “interpretable” solution as a table of loadings or graphically, where all loadings with an absolute value \(\rightarrow\) some cut point are represented as an edge (path).

fa.diagram(fa.none)

The first 4 factors have an Eigenvalue >1 and which explains almost 69% of the variance. We can effectively reduce dimensionality from 11 to 4 while only losing about 31% of the variance.
Factor 1 accounts for 29.20% of the variance; Factor 2 accounts for 20.20% of the variance; Factor 3 accounts for 13.60% of the variance; Factor 4 accounts for 6% of the variance. All the 4 factors together explain for 69% of the variance in performance.

fa.var <-  fa(r=X, 
              nfactors = 4, 
#              covar = FALSE, SMC = TRUE,
              fm="pa", # type of factor analysis we want to use (“pa” is principal axis factoring)
              max.iter=100, # (50 is the default, but we have changed it to 100
              rotate="varimax") # none rotation
print(fa.var)

Factor Analysis using method =  pa
Call: fa(r = X, nfactors = 4, rotate = "varimax", max.iter = 100, fm = "pa")
Standardized loadings (pattern matrix) based upon correlation matrix
              PA1   PA2   PA3   PA4   h2    u2 com
ProdQual     0.02 -0.07  0.02  0.65 0.42 0.576 1.0
Ecom         0.07  0.79  0.03 -0.11 0.64 0.362 1.1
TechSup      0.02 -0.03  0.88  0.12 0.79 0.205 1.0
CompRes      0.90  0.13  0.05  0.13 0.84 0.157 1.1
Advertising  0.17  0.53 -0.04 -0.06 0.31 0.686 1.2
ProdLine     0.53 -0.04  0.13  0.71 0.80 0.200 1.9
SalesFImage  0.12  0.97  0.06 -0.13 0.98 0.021 1.1
ComPricing  -0.08  0.21 -0.21 -0.59 0.44 0.557 1.6
WartyClaim   0.10  0.06  0.89  0.13 0.81 0.186 1.1
OrdBilling   0.77  0.13  0.09  0.09 0.62 0.378 1.1
DelSpeed     0.95  0.19  0.00  0.09 0.94 0.058 1.1

                       PA1  PA2  PA3  PA4
SS loadings           2.63 1.97 1.64 1.37
Proportion Var        0.24 0.18 0.15 0.12
Cumulative Var        0.24 0.42 0.57 0.69
Proportion Explained  0.35 0.26 0.22 0.18
Cumulative Proportion 0.35 0.60 0.82 1.00

Mean item complexity =  1.2
Test of the hypothesis that 4 factors are sufficient.

The degrees of freedom for the null model are  55  and the objective function was  6.55 with Chi Square of  619.27
The degrees of freedom for the model are 17  and the objective function was  0.33 

The root mean square of the residuals (RMSR) is  0.02 
The df corrected root mean square of the residuals is  0.03 

The harmonic number of observations is  100 with the empirical chi square  3.19  with prob <  1 
The total number of observations was  100  with Likelihood Chi Square =  30.27  with prob <  0.024 

Tucker Lewis Index of factoring reliability =  0.921
RMSEA index =  0.088  and the 90 % confidence intervals are  0.032 0.139
BIC =  -48.01
Fit based upon off diagonal values = 1
Measures of factor score adequacy             
                                                   PA1  PA2  PA3  PA4
Correlation of (regression) scores with factors   0.98 0.99 0.94 0.88
Multiple R square of scores with factors          0.96 0.97 0.88 0.78
Minimum correlation of possible factor scores     0.93 0.94 0.77 0.55

fa.var$loadings


Loadings:
            PA1    PA2    PA3    PA4   
ProdQual                          0.647
Ecom                0.787        -0.113
TechSup                    0.883  0.116
CompRes      0.898  0.130         0.132
Advertising  0.166  0.530              
ProdLine     0.525         0.127  0.712
SalesFImage  0.115  0.971        -0.135
ComPricing          0.213 -0.209 -0.590
WartyClaim   0.103         0.885  0.128
OrdBilling   0.768  0.127              
DelSpeed     0.949  0.185              

                 PA1   PA2   PA3   PA4
SS loadings    2.635 1.967 1.641 1.371
Proportion Var 0.240 0.179 0.149 0.125
Cumulative Var 0.240 0.418 0.568 0.692

fa.diagram(fa.var)

The red dotted line means that Competitive Pricing marginally falls under the PA4 bucket and the loading are negative.

Labeling and interpretation of the factors

Factors	Variables	Label	Short Interpretaation
PA1	DelSpeed, CompRes, OrdBilling	Purchase	All the items are related to purchasing the product; from placing order to billing and getting it delivered.
PA2	SalesFImage, Ecom, Advertising	Marketing	In this factors the item are related to marketing processes like the image of sales force and spending on adverstising
PA3	WartyClaim, TechSup	Post Purchase	Post purchase activities are included in this factor; like warranty claims and technical support.
PA4	ProdLine, ProdQual, CompPricing	Product Position	Product positioning related items are grouped in this factor.

Scores for all the rows

head(fa.var$scores)

            PA1        PA2          PA3         PA4
[1,] -0.1338871  0.9175166 -1.719604873  0.09135411
[2,]  1.6297604 -2.0090053 -0.596361722  0.65808192
[3,]  0.3637658  0.8361736  0.002979966  1.37548765
[4,] -1.2225230 -0.5491336  1.245473305 -0.64421384
[5,] -0.4854209 -0.4276223 -0.026980304  0.47360747
[6,] -0.5950924 -1.3035333 -1.183019401 -0.95913571

#fa.var$scores # for 100 observations

Regression analysis using the factors scores as the independent variable

regdata <- cbind(prodsurvey["Satisfaction"], fa.var$scores)

#Labeling the data
names(regdata) <- c("Satisfaction", "Purchase", "Marketing",
                    "Post_purchase", "Prod_positioning")
head(regdata)

  Satisfaction   Purchase  Marketing Post_purchase Prod_positioning
1          8.2 -0.1338871  0.9175166  -1.719604873       0.09135411
2          5.7  1.6297604 -2.0090053  -0.596361722       0.65808192
3          8.9  0.3637658  0.8361736   0.002979966       1.37548765
4          4.8 -1.2225230 -0.5491336   1.245473305      -0.64421384
5          7.1 -0.4854209 -0.4276223  -0.026980304       0.47360747
6          4.7 -0.5950924 -1.3035333  -1.183019401      -0.95913571

Splitting the data to train and test set

#Splitting the data 70:30
#Random number generator, set seed.
set.seed(100)
indices= sample(1:nrow(regdata), 0.7*nrow(regdata))
train=regdata[indices,]
test = regdata[-indices,]

Regression Model using train data

model.fa.score = lm(Satisfaction~., train)
summary(model.fa.score)


Call:
lm(formula = Satisfaction ~ ., data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.76280 -0.48717  0.06799  0.46459  1.24022 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.91852    0.08068  85.757  < 2e-16 ***
Purchase          0.50230    0.07641   6.574 9.74e-09 ***
Marketing         0.75488    0.08390   8.998 5.00e-13 ***
Post_purchase     0.08755    0.08216   1.066    0.291    
Prod_positioning  0.58074    0.08781   6.614 8.30e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6629 on 65 degrees of freedom
Multiple R-squared:  0.7261,    Adjusted R-squared:  0.7093 
F-statistic: 43.08 on 4 and 65 DF,  p-value: < 2.2e-16

Check vif

vif(model.fa.score)

        Purchase        Marketing    Post_purchase Prod_positioning 
        1.009648         1.008235         1.015126         1.024050

As per the VIF values, we don’t have multicollinearity in the model

Check prediction of the model in the test dataset

#Model Performance metrics:
pred_test <- predict(model.fa.score, newdata = test, type = "response")
pred_test

       6        8       11       13       17       19       26 
4.975008 5.908267 6.951629 8.677431 6.613838 6.963113 6.313513 
      33       34       35       37       40       42       44 
6.141338 6.158993 7.415742 6.589746 6.858206 7.133989 8.533080 
      49       50       53       56       57       60       65 
8.765145 8.078744 7.395438 7.468360 8.744402 6.276660 5.936570 
      67       71       73       75       80       96       97 
6.650322 8.299545 7.685564 7.330191 6.719528 7.540233 6.143172 
      99      100 
8.084583 5.799897

#Find MSE and MAPE scores:
#MSE/ MAPE of Model1
test$Satisfaction_Predicted <- pred_test
head(test[c("Satisfaction","Satisfaction_Predicted")], 10)

   Satisfaction Satisfaction_Predicted
6           4.7               4.975008
8           6.3               5.908267
11          7.4               6.951629
13          8.4               8.677431
17          6.4               6.613838
19          6.8               6.963113
26          6.6               6.313513
33          5.4               6.141338
34          7.3               6.158993
35          6.3               7.415742

test_r2 <- cor(test$Satisfaction, test$Satisfaction_Predicted) ^2

mse_test <- mse(test$Satisfaction, pred_test)
rmse_test <- sqrt(mse(test$Satisfaction, pred_test))
mape_test <- mape(test$Satisfaction, pred_test)
model_metrics <- cbind(mse_test,rmse_test,mape_test,test_r2)
print(model_metrics, 3)

     mse_test rmse_test mape_test test_r2
[1,]    0.551     0.742    0.0875   0.554

As the feature Post_purchase is not significant so we will drop this feature and then let’s run the regression model again.

##Regression model without post_purchase:
model2 <- lm(Satisfaction ~ Purchase+ Marketing+ 
                Prod_positioning, data= train)
summary(model2)


Call:
lm(formula = Satisfaction ~ Purchase + Marketing + Prod_positioning, 
    data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7268 -0.5036  0.1117  0.4408  1.2962 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.92158    0.08071  85.761  < 2e-16 ***
Purchase          0.49779    0.07637   6.518 1.15e-08 ***
Marketing         0.75870    0.08391   9.042 3.66e-13 ***
Prod_positioning  0.59084    0.08739   6.761 4.30e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6636 on 66 degrees of freedom
Multiple R-squared:  0.7213,    Adjusted R-squared:  0.7087 
F-statistic: 56.95 on 3 and 66 DF,  p-value: < 2.2e-16

pred_test2 <- predict(model2, newdata = test, type = "response")
pred_test2

       6        8       11       13       17       19       26 
5.069667 5.958765 7.001694 8.695630 6.521440 6.994018 6.249667 
      33       34       35       37       40       42       44 
6.123357 6.132165 7.415882 6.566556 6.964570 7.233376 8.424720 
      49       50       53       56       57       60       65 
8.853060 8.177009 7.413693 7.404078 8.746511 6.215816 5.893515 
      67       71       73       75       80       96       97 
6.476465 8.227837 7.839941 7.354907 6.734279 7.559995 6.293774 
      99      100 
8.083909 5.832094

test$Satisfaction_Predicted2 <- pred_test2
head(test[c(1,7)], 10)

   Satisfaction Satisfaction_Predicted2
6           4.7                5.069667
8           6.3                5.958765
11          7.4                7.001694
13          8.4                8.695630
17          6.4                6.521440
19          6.8                6.994018
26          6.6                6.249667
33          5.4                6.123357
34          7.3                6.132165
35          6.3                7.415882

test_r22 <- cor(test$Satisfaction, test$Satisfaction_Predicted2) ^2
mse_test2 <- mse(test$Satisfaction, pred_test2)
rmse_test2 <- sqrt(mse(test$Satisfaction, pred_test2))
mape_test2 <- mape(test$Satisfaction, pred_test2)

model2_metrics <- cbind(mse_test2,rmse_test2,mape_test2,test_r22)
model2_metrics

     mse_test2 rmse_test2 mape_test2  test_r22
[1,] 0.5446039  0.7379728 0.08490197 0.5595099

Overall <- rbind(model_metrics,model2_metrics)
row.names(Overall) <- c("Test1", "Test2")
colnames(Overall) <- c("MSE", "RMSE", "MAPE", "R-squared")
print(Overall,3)

        MSE  RMSE   MAPE R-squared
Test1 0.551 0.742 0.0875     0.554
Test2 0.545 0.738 0.0849     0.560

Try the interaction?!

Including Interaction model, we are able to make a better prediction!??

Conclusion

We saw how Factor Analysis can be used to reduce the dimensionality of a dataset and then we used multiple linear regression on the dimensionally reduced columns/Features for further analysis/predictions.

Factor Analysis applied in Multivariate Regression: Machine Learning’s Perspective